Effective Pandas by Matt Harrison is a guide to the Pandas library, a powerful Python tool for data manipulation and analysis. The book covers a wide range of topics, from basic data structures to advanced techniques for data cleaning, transformation, and visualization.

I have found Effective Pandas to be a captivating and thought-provoking read. The book offers a genuinely unique take on data wrangling, putting a great emphasis on the utility of chaining methods and utilizing the lambda function. I have found these ideas to be so useful and practical that I have revisited the book multiple times just to make sure I keep them fresh in my memory. I must have read the book from back to back at least 3-4 times.

In this article, I will share the top 10 (+1) things I learned from Effective Pandas. These are the concepts and techniques that I found to be the most useful and practical.

We will use the Real World Smartphone’s Dataset by Abhijit Dahatonde from Kaggle. Let’s get to it.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

Load the dataset

Code

df=pd.read_csv('smartphones.csv')
# some info about the dataframe, such as dimensions and dtypes of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 980 entries, 0 to 979
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   brand_name                 980 non-null    object 
 1   model                      980 non-null    object 
 2   price                      980 non-null    int64  
 3   avg_rating                 879 non-null    float64
 4   5G_or_not                  980 non-null    int64  
 5   processor_brand            960 non-null    object 
 6   num_cores                  974 non-null    float64
 7   processor_speed            938 non-null    float64
 8   battery_capacity           969 non-null    float64
 9   fast_charging_available    980 non-null    int64  
 10  fast_charging              769 non-null    float64
 11  ram_capacity               980 non-null    int64  
 12  internal_memory            980 non-null    int64  
 13  screen_size                980 non-null    float64
 14  refresh_rate               980 non-null    int64  
 15  num_rear_cameras           980 non-null    int64  
 16  os                         966 non-null    object 
 17  primary_camera_rear        980 non-null    float64
 18  primary_camera_front       975 non-null    float64
 19  extended_memory_available  980 non-null    int64  
 20  resolution_height          980 non-null    int64  
 21  resolution_width           980 non-null    int64  
dtypes: float64(8), int64(10), object(4)
memory usage: 168.6+ KB

Tip #1: Use `pd.assign` more extensively

The pd.assign method in Pandas is a very powerful tool that can be used to create new columns, modify existing columns, or both. It is a very versatile method that can be used in a variety of ways.

One of the most important benefits of using the assign method is that it can be incorporated into method chaining. This means that you can chain multiple assign methods together to create a more concise and readable code. Another benefit of using the assign method is that it completely sidesteps the infamous SettingWithCopyWarning. This warning is often triggered when you try to modify an existing column in a DataFrame. However, the assign method creates a new DataFrame, so there is no need to worry about this warning.

Problem statement: Let’s say we would like to capitalize the brand names located in the brand_name column as well as calculate the Pixels Per Inch (PPI). PPI can be calculated following the equation described by the Pixel density page on Wikipedia.

Code

(df
 .assign(brand_name=lambda df: df.brand_name.str.capitalize(), # capitalizes the brand names 
         PPI=lambda df: (np.sqrt(np.square(df.resolution_height) + np.square(df.resolution_width))
                         .div(df.screen_size)
                         .round(1)
                        )
        )
 .loc[:, ['brand_name','model','PPI']]
 .sort_values(by='PPI',ascending=False)
 .head(5)
)

	brand_name	model	PPI
689	Sony	Sony Xperia 1 IV (12GB RAM + 512GB)	642.6
696	Sony	Sony Xperia Pro-I	642.6
688	Sony	Sony Xperia 1 II	642.6
655	Samsung	Samsung Galaxy S20	566.0
656	Samsung	Samsung Galaxy S20 5G	566.0

Tip #2: Simplify the management of multiple if-else conditions using `np.select`

If our goal is to incorporate if-else logic seamlessly into our code, we can effortlessly achieve this using either pd.mask or pd.where. Yet, what approach should we adopt when we need to evaluate multiple conditions instead of just two? In such situations, we have two options: we can either employ successive pd.mask or pd.where calls, or we can take advantage of the np.select function as an alternative solution.

Problem statement: We want to identify the top 3 and top 5 most popular processor brands in smartphones. To do this, we will first create two lists, one for the top 3 brands and one for the top 5 brands. Any processor brand that is not in either of these lists will be categorized as “Other”.

Code

# Let's create the two lists that contain the top3 and top5 brand names
top3=df.processor_brand.value_counts().head(3).index
top5=df.processor_brand.value_counts().head(5).index
print(f'Top 3 most popular processors: {top3.tolist()}')
print(f'Top 5 most popular processors: {top5.tolist()}')

Top 3 most popular processors: ['snapdragon', 'helio', 'dimensity']
Top 5 most popular processors: ['snapdragon', 'helio', 'dimensity', 'exynos', 'bionic']

Code

'''
Here's an example that employs two successive pd.where calls:
In the first pd.where call, it checks whether the brand is in the top 3; if not, it assigns the label "Top5" to it.
Then, in the second call, it checks if the value is in the top 5; if not, it appends the category "Other".
As you can see, the logic can become intricate and difficult to grasp, especially when dealing with numerous conditions, 
making the code cumbersome and hard to manage.
'''
(df
 .assign(frequency=lambda df: df.processor_brand
         .where(df.processor_brand.isin(top3), other = 'Top5')
         .where(df.processor_brand.isin(top5), other = 'Other')
        )
 .loc[:, 'frequency']
 .value_counts()
)

snapdragon    413
helio         201
dimensity     177
Top5           95
Other          94
Name: frequency, dtype: int64

Code

'''
 Now Let's see np.select!
 It simplifies the process significantly. By providing a list of conditions we want to evaluate and their 
 corresponding values if the condition evaluates to True, we can handle multiple conditions effortlessly. 
 Additionally, we can specify a default value if none of the conditions evaluates to True, making the code 
 much more straightforward and easier to manage. 
'''
(df
 .assign(frequency=lambda df: np.select(condlist=[df.processor_brand.isin(top3), df.processor_brand.isin(top5)],
                                        choicelist=[df.processor_brand,'Top5'],
                                        default='Other'
                                       )
        )
 .loc[:, 'frequency']
 .value_counts()
)

snapdragon    413
helio         201
dimensity     177
Top5           95
Other          94
Name: frequency, dtype: int64

Tip #3 Filter rows or columns with the combination of `pd.loc` and `lambda`

Some experienced Pandas users might consider the following concept trivial, but it was an eye-opener for me after reading the book. It turns out that combining pd.loc and lambda (or any custom functions) allows us to filter both rows and columns, depending on our specific needs.

Problem statement: We are interested in identifying phones with a battery capacity greater than 5000mAh.

Code

(df
 .loc[lambda df: df.battery_capacity.gt(5000),['model', 'battery_capacity']] # here we use pd.gt() to select values greater than 5000
 .sort_values(by='battery_capacity')
)

	model	battery_capacity
70	Google Pixel 6 Pro (12GB RAM + 256GB)	5003.0
69	Google Pixel 6 Pro	5003.0
977	Xiaomi Redmi Note 9 Pro Max	5020.0
922	Xiaomi Redmi Note 10 Lite	5020.0
923	Xiaomi Redmi Note 10 Lite (4GB RAM + 128GB)	5020.0
...	...	...
624	Samsung Galaxy F63	7000.0
411	Oukitel WP9	8000.0
410	Oukitel WP21	9800.0
409	Oukitel WP19	21000.0
58	Doogee V Max	22000.0

113 rows × 2 columns

Tip #4 Rename multiple columns effortlessly with `rename` and `replace`

OK. This is a big one for me. I used this one multiple times already. The title pretty much says it all. Let’s say our column names contain spaces, which makes column selection by attribute access pretty much impossible. Now, what do we do? Well…Let’s see.

Problem statement: We would like to remove all the underscores from our column names.

Code

df.columns # original column names for reference

Index(['brand_name', 'model', 'price', 'avg_rating', '5G_or_not',
       'processor_brand', 'num_cores', 'processor_speed', 'battery_capacity',
       'fast_charging_available', 'fast_charging', 'ram_capacity',
       'internal_memory', 'screen_size', 'refresh_rate', 'num_rear_cameras',
       'os', 'primary_camera_rear', 'primary_camera_front',
       'extended_memory_available', 'resolution_height', 'resolution_width'],
      dtype='object')

Code

# column names after replacing underscores
(df
 .rename(columns = lambda x: x.replace('_', ''))
 .columns
)

Index(['brandname', 'model', 'price', 'avgrating', '5Gornot', 'processorbrand',
       'numcores', 'processorspeed', 'batterycapacity',
       'fastchargingavailable', 'fastcharging', 'ramcapacity',
       'internalmemory', 'screensize', 'refreshrate', 'numrearcameras', 'os',
       'primarycamerarear', 'primarycamerafront', 'extendedmemoryavailable',
       'resolutionheight', 'resolutionwidth'],
      dtype='object')

Tip #5: Use `pd.clip` to easily remove outliers

Code

# First, we'll identify the phone brands with the most number of handsets present in our dataset."
top10_brand_names = (df
                     .brand_name
                     .value_counts()
                     .head(10)
                     .index
                     .tolist()
                    )
print(top10_brand_names)

# Then we will sort them based on median price
top10_brand_names_ordered = (df
                             .loc[lambda x: x.brand_name.isin(top10_brand_names),['brand_name', 'price']]
                             .groupby('brand_name')
                             .median()
                             .sort_values(by='price')
                             .index
                             .to_list()
                            )
print(top10_brand_names_ordered)

['xiaomi', 'samsung', 'vivo', 'realme', 'oppo', 'motorola', 'apple', 'oneplus', 'poco', 'tecno']
['tecno', 'poco', 'realme', 'xiaomi', 'motorola', 'vivo', 'oppo', 'samsung', 'oneplus', 'apple']

Code

fig, axs = plt.subplots(1,2, figsize=(10,5))

# For reference, this is what our box plot looks if we leave in the outlier values
(df
 .loc[lambda x: x.brand_name.isin(top10_brand_names),['brand_name', 'price']] # filter rows based on top10_brand_names and select columns
 .pivot(columns='brand_name',values='price') # pivot to get the brand names on the x axis later on
 .loc[:, top10_brand_names_ordered] # order the columns based on median price
 .plot
 .box(title='Top 10 most popular smartphone brands \n -outlier values included-', 
      rot=90,
      ax=axs[0]
     )
)

(df
 .loc[lambda x: x['brand_name'].isin(top10_brand_names), ['brand_name', 'price']]
 .pivot(columns='brand_name', values='price')
 .loc[:, top10_brand_names_ordered]
 .pipe(lambda df: df.assign(**{col : df[col].clip(lower=df[col].quantile(0.05), # this is called dictionary unpacking
                                                  upper=df[col].quantile(0.95))
                              for col in df.columns}))
 .plot
 .box(title='Top 10 most popular smartphone brands \n -only values between 5th and 95th percentiles included-',
      rot=90,
      ax=axs[1]
     )
)
axs[0].set(ylabel='Price in local currency')
plt.tight_layout()

Tip #6: Find corrupted entries with `str.extract`

How often have you encountered the situation where, for some reason, a column that is expected to only contain numerical values displays object as its dtype? This often indicates the presence of some string values mixed within the column. It would be beneficial to promptly identify all the erroneous values, correct them, and proceed with our analysis smoothly. Let’s explore what we can do in such scenarios.

Problem statement: We would like to identify any cells in a specific column that contain non-numerical values.

Code

df_bad_values = df.copy(deep=True) # let's prepare a copy of the original dataframe 

# let's modify some of the values in the price column randomly:
df_bad_values.loc[np.random.randint(0, high=df_bad_values.shape[0], size=10), 'price'] = '.'
df_bad_values.loc[np.random.randint(0, high=df_bad_values.shape[0], size=10), 'price'] = '-'
df_bad_values.loc[np.random.randint(0, high=df_bad_values.shape[0], size=10), 'price'] = '*'
df_bad_values.info() # the modified dataframe's price column now returns object as dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 980 entries, 0 to 979
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   brand_name                 980 non-null    object 
 1   model                      980 non-null    object 
 2   price                      980 non-null    object 
 3   avg_rating                 879 non-null    float64
 4   5G_or_not                  980 non-null    int64  
 5   processor_brand            960 non-null    object 
 6   num_cores                  974 non-null    float64
 7   processor_speed            938 non-null    float64
 8   battery_capacity           969 non-null    float64
 9   fast_charging_available    980 non-null    int64  
 10  fast_charging              769 non-null    float64
 11  ram_capacity               980 non-null    int64  
 12  internal_memory            980 non-null    int64  
 13  screen_size                980 non-null    float64
 14  refresh_rate               980 non-null    int64  
 15  num_rear_cameras           980 non-null    int64  
 16  os                         966 non-null    object 
 17  primary_camera_rear        980 non-null    float64
 18  primary_camera_front       975 non-null    float64
 19  extended_memory_available  980 non-null    int64  
 20  resolution_height          980 non-null    int64  
 21  resolution_width           980 non-null    int64  
dtypes: float64(8), int64(9), object(5)
memory usage: 168.6+ KB

Code

# let's find the corrupted values easily:
(df_bad_values
 .price
 .str.extract(r'([^a-zA-Z])') # returns NON-matching alphabetical characters
 .value_counts()
)

*    10
-    10
.    10
dtype: int64

Tip #7: Sort values based on the `key` parameter

The pd.sort_values function surprised me with its versatility. Previously, I had only used its by, axis, and ascending parameters for sorting. However, Matt’s book introduced me to its key parameter, which allows us to apply any function to sort the values. The only constraint is that the function must return a Series.

Problem statement: For whatever reason, we would like to sort the phone model names in ascending order based on their second letter.

Code

(df
 .iloc[:, 1:3]
 .sort_values(by='model',
              key = lambda x: x.str[1],
              ascending = True
             )
 .head(10)
)

	model	price
55	CAT S22 Flip	14999
697	TCL Ion X	8990
195	LG V60 ThinQ	79990
196	LG Velvet 5G	54999
197	LG Wing 5G	54999
108	iKall Z19 Pro	8099
107	iKall Z19	7999
106	iKall Z18	6799
54	BLU F91 5G	14990
413	POCO C31 (4GB RAM + 64GB)	7499

Tip #8: Reference an existing variable inside `pd.query` with @

This resembles Tip #4, as it’s a technique I frequently use. Since reading Matt’s book, I have started using pd.query extensively to filter rows based on values, instead of relying on .loc or .iloc. In case you choose to adopt pd.query as well, it’s essential to be aware of its capability to use “@” to reference variables in the environment. This feature enhances its flexibility and makes it even more convenient to apply in various data filtering scenarios.

Problem statement: Our objective is to identify phones that meet three specific criteria: being priced below the average market price, having more processor cores than the average, and possessing a battery capacity greater than the average.

Code

average_price=df.price.mean()
average_cores=df.num_cores.mean()
average_battery=df.battery_capacity.mean()

(df
 .query("(price <= @average_price) and (num_cores >= @average_cores) and (battery_capacity >= @average_battery)")
 .iloc[:,:3]
 .sort_values(by='price')
 .head()
)

	brand_name	model	price
498	realme	Realme C30	5299
179	itel	itel Vision 3 (2GB RAM + 32GB)	5785
202	micromax	Micromax IN 2C	5999
729	tecno	Tecno Spark Go 2022	6249
499	realme	Realme C30 (3GB RAM + 32GB)	6299

Tip #9: Gain more insights by using `style.background_gradient`

Being a visual creature, I often struggle to comprehend data quickly just by examining the raw table alone. Fortunately, with the help of style.background_gradient, similar to the heatmap function in the seaborn library, we can represent cells, in terms of color gradient, based on their values. This enables us to identify trends and patterns in our data swiftly, making data analysis more intuitive and insightful.

Problem statement: Our goal is to identify the overall trends related to key descriptors found among the Top 10 smartphone brands, aiming to determine which brand offers the most value for our money.

Code

(df
 .query("brand_name.isin(@top10_brand_names_ordered)") # filter rows based on top10 brands
 .groupby(['brand_name',])
 [['avg_rating', 'processor_speed', 'ram_capacity', 
   'screen_size', 'battery_capacity', 'price']]
 .mean()
 .sort_values(by='avg_rating')
 .transpose()
 .rename(columns=str.capitalize) # capitalize brand names
 .style
 .set_caption("Key descriptors of the Top 10 smartphone brands")
 .format(precision=1)
 .background_gradient(cmap = 'vlag', axis = 1)
 .set_table_styles([
    {'selector': 'td', 'props': 'text-align: center;'},
     {'selector': 'caption','props': 'font-size:1.5em; font-weight:bold;'}
     ,]
 )
)

Key descriptors of the Top 10 smartphone brands
brand_name	Tecno	Realme	Apple	Vivo	Poco	Samsung	Oppo	Xiaomi	Motorola	Oneplus
avg_rating	7.4	7.6	7.7	7.7	7.9	7.9	7.9	7.9	8.0	8.2
processor_speed	2.1	2.3	3.1	2.4	2.5	2.4	2.5	2.4	2.5	2.7
ram_capacity	5.4	5.7	5.3	6.7	6.1	6.5	7.5	6.4	6.1	8.2
screen_size	6.7	6.5	6.1	6.5	6.6	6.6	6.6	6.6	6.6	6.6
battery_capacity	5333.9	4903.1	3527.2	4703.7	5009.4	4917.4	4667.2	4957.6	4863.1	4759.5
price	14545.4	17461.4	95966.5	26782.4	18479.2	36843.0	29650.0	27961.1	24099.9	35858.6

Tip #10: Use `pd.pipe` to include any functions in our chain

One of the most valuable lessons I learned from Effective Pandas is the importance of arranging my code in a chain. Although it may feel somewhat restrictive at first, once you overcome the initial hurdles, you’ll realize that your code becomes more readable and easier to understand. The need to invent unique names for temporary variables is completely eliminated, making coding a much happier experience.

In the chaining world, you often find yourself wanting to use various functions that are not explicitly designed for chaining. However, there’s good news! You can still achieve this. The pd.pipe function comes to the rescue, allowing you to use any function as long as it returns a Series or DataFrame. It’s a flexible solution that empowers you to seamlessly integrate different functions into your chaining workflow, making your data manipulation more efficient and enjoyable.

Problem statement: We aim to visualize the impact of RAM capacity on user satisfaction. To achieve this, we will utilize the sns.lmplot function, which plots the data and corresponding regression models for the Top 5 phone brands.

Code

top5_brand_names_ordered = df.brand_name.value_counts().head().index

with sns.axes_style("darkgrid"):
    g = (df
         .query("brand_name.isin(@top5_brand_names_ordered)") # filter rows based on top10 brands
         [['brand_name', 'avg_rating', 'ram_capacity']]
         .assign(brand_name=lambda df: df.brand_name.str.capitalize())
         .rename(columns={'brand_name':'Brand name'})
         .pipe(lambda df: sns.lmplot(data=df,
                                     x='ram_capacity',
                                     y='avg_rating',
                                     hue='Brand name',
                                     # height=4,
                                     # aspect=1.2
                                    )
              )
        )

    g.set(title='Customer satisfaction correlates with RAM capacity', 
          xlabel='RAM capacity',
          ylabel='User rating'
         )
plt.tight_layout()

Tip #10 + 1: Use the “margin” parameter of `pd.crosstab` to easily calculate row/column subtotals

Despite primarily using the pandas groupby function for data aggregation, the pd.crosstab function has an enticing feature: the margin parameter. This option enables us to effortlessly calculate subtotals across rows and columns. Moreover, by normalizing our data, we can gain even more intuition about the questions we want to answer.

Problem statement: Our objective is to evaluate how RAM capacity impacts user satisfaction across the Top 5 brands. Additionally, we will normalize our data to compare values comprehensively across the entire dataset.

Code

(df
 .query("brand_name.isin(@top5_brand_names_ordered)") # filter rows based on top10 brands
 .assign(brand_name=lambda df: df.brand_name.str.capitalize())
 .pipe(lambda df: pd.crosstab(index=df['ram_capacity'],
                              columns=df['brand_name'],
                              values=df['avg_rating'],
                              aggfunc='mean',
                              margins=True,
                              normalize='all'
                             )
      )
 .mul(100)
 .round(1)
)

brand_name	Oppo	Realme	Samsung	Vivo	Xiaomi	All
ram_capacity
2	0.0	2.7	2.8	2.7	2.7	11.5
3	2.9	2.8	2.9	2.9	2.8	12.1
4	3.1	3.2	3.2	3.2	3.2	13.5
6	3.3	3.5	3.5	3.4	3.5	14.8
8	3.7	3.7	3.8	3.7	3.8	15.7
12	3.8	3.9	3.9	3.8	3.9	16.3
16	3.8	0.0	0.0	0.0	0.0	16.2
All	20.2	19.6	20.2	19.8	20.2	100.0

I hope this article has convinced you to pick up Matt Harrison’s Effective Pandas! There are plenty more exciting ideas in the book beyond the Top 10 I’ve shared here (I didn’t even get into the fascinating time series part!). I hope you found these insights helpful and inspiring.

Happy coding 🐼💻🚀

Load the dataset

Tip #1: Use pd.assign more extensively

Tip #2: Simplify the management of multiple if-else conditions using np.select

Tip #3 Filter rows or columns with the combination of pd.loc and lambda

Tip #4 Rename multiple columns effortlessly with rename and replace

Tip #5: Use pd.clip to easily remove outliers

Tip #6: Find corrupted entries with str.extract

Tip #7: Sort values based on the key parameter

Tip #8: Reference an existing variable inside pd.query with @

Tip #9: Gain more insights by using style.background_gradient

Tip #10: Use pd.pipe to include any functions in our chain

Tip #10 + 1: Use the “margin” parameter of pd.crosstab to easily calculate row/column subtotals

Tip #1: Use `pd.assign` more extensively

Tip #2: Simplify the management of multiple if-else conditions using `np.select`

Tip #3 Filter rows or columns with the combination of `pd.loc` and `lambda`

Tip #4 Rename multiple columns effortlessly with `rename` and `replace`

Tip #5: Use `pd.clip` to easily remove outliers

Tip #6: Find corrupted entries with `str.extract`

Tip #7: Sort values based on the `key` parameter

Tip #8: Reference an existing variable inside `pd.query` with @

Tip #9: Gain more insights by using `style.background_gradient`

Tip #10: Use `pd.pipe` to include any functions in our chain

Tip #10 + 1: Use the “margin” parameter of `pd.crosstab` to easily calculate row/column subtotals